The UCI ML Breast Cancer Wisconsin (Diagnostic) dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of cell nuclei present in the image. The target variable is binary, diagnosing whether the mass is malignant or benign.
Refer to the documentation for the load_breast_cancer function in the scikit-learn library for more information about this dataset and instructions on how to load it directly.
The objective of this project is to apply Principal Component Analysis (PCA) to reduce dimensionality and select the most relevant components, enhancing model performance while retaining essential information for accurate breast cancer classification.
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy.spatial.distance import mahalanobis
from scipy.stats import chi2
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data.target[[10, 50, 85]]
list(data.target_names)
['malignant', 'benign']
print(data.DESCR)
.. _breast_cancer_dataset:
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
worst/largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 0 is Mean Radius, field
10 is Radius SE, field 20 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ====== ======
Min Max
===================================== ====== ======
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ====== ======
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
:Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
:Donor: Nick Street
:Date: November, 1995
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
|details-start|
**References**
|details-split|
- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.
|details-end|
import pandas as pd
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
display(df.head().T)
print(f"This dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| mean radius | 17.990000 | 20.570000 | 19.690000 | 11.420000 | 20.290000 |
| mean texture | 10.380000 | 17.770000 | 21.250000 | 20.380000 | 14.340000 |
| mean perimeter | 122.800000 | 132.900000 | 130.000000 | 77.580000 | 135.100000 |
| mean area | 1001.000000 | 1326.000000 | 1203.000000 | 386.100000 | 1297.000000 |
| mean smoothness | 0.118400 | 0.084740 | 0.109600 | 0.142500 | 0.100300 |
| mean compactness | 0.277600 | 0.078640 | 0.159900 | 0.283900 | 0.132800 |
| mean concavity | 0.300100 | 0.086900 | 0.197400 | 0.241400 | 0.198000 |
| mean concave points | 0.147100 | 0.070170 | 0.127900 | 0.105200 | 0.104300 |
| mean symmetry | 0.241900 | 0.181200 | 0.206900 | 0.259700 | 0.180900 |
| mean fractal dimension | 0.078710 | 0.056670 | 0.059990 | 0.097440 | 0.058830 |
| radius error | 1.095000 | 0.543500 | 0.745600 | 0.495600 | 0.757200 |
| texture error | 0.905300 | 0.733900 | 0.786900 | 1.156000 | 0.781300 |
| perimeter error | 8.589000 | 3.398000 | 4.585000 | 3.445000 | 5.438000 |
| area error | 153.400000 | 74.080000 | 94.030000 | 27.230000 | 94.440000 |
| smoothness error | 0.006399 | 0.005225 | 0.006150 | 0.009110 | 0.011490 |
| compactness error | 0.049040 | 0.013080 | 0.040060 | 0.074580 | 0.024610 |
| concavity error | 0.053730 | 0.018600 | 0.038320 | 0.056610 | 0.056880 |
| concave points error | 0.015870 | 0.013400 | 0.020580 | 0.018670 | 0.018850 |
| symmetry error | 0.030030 | 0.013890 | 0.022500 | 0.059630 | 0.017560 |
| fractal dimension error | 0.006193 | 0.003532 | 0.004571 | 0.009208 | 0.005115 |
| worst radius | 25.380000 | 24.990000 | 23.570000 | 14.910000 | 22.540000 |
| worst texture | 17.330000 | 23.410000 | 25.530000 | 26.500000 | 16.670000 |
| worst perimeter | 184.600000 | 158.800000 | 152.500000 | 98.870000 | 152.200000 |
| worst area | 2019.000000 | 1956.000000 | 1709.000000 | 567.700000 | 1575.000000 |
| worst smoothness | 0.162200 | 0.123800 | 0.144400 | 0.209800 | 0.137400 |
| worst compactness | 0.665600 | 0.186600 | 0.424500 | 0.866300 | 0.205000 |
| worst concavity | 0.711900 | 0.241600 | 0.450400 | 0.686900 | 0.400000 |
| worst concave points | 0.265400 | 0.186000 | 0.243000 | 0.257500 | 0.162500 |
| worst symmetry | 0.460100 | 0.275000 | 0.361300 | 0.663800 | 0.236400 |
| worst fractal dimension | 0.118900 | 0.089020 | 0.087580 | 0.173000 | 0.076780 |
| target | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
This dataset contains 569 rows and 31 columns.
df.columns
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension',
'target'],
dtype='object')
df.isnull().sum()
mean radius 0 mean texture 0 mean perimeter 0 mean area 0 mean smoothness 0 mean compactness 0 mean concavity 0 mean concave points 0 mean symmetry 0 mean fractal dimension 0 radius error 0 texture error 0 perimeter error 0 area error 0 smoothness error 0 compactness error 0 concavity error 0 concave points error 0 symmetry error 0 fractal dimension error 0 worst radius 0 worst texture 0 worst perimeter 0 worst area 0 worst smoothness 0 worst compactness 0 worst concavity 0 worst concave points 0 worst symmetry 0 worst fractal dimension 0 target 0 dtype: int64
The dataset has 30 numerical features and 1 target.
No missing values.
plt.figure(figsize=(13, 10))
sns.pairplot(df.drop('target', axis=1),
diag_kind='hist', corner=True, diag_kws={'color': 'grey'})
plt.yticks(rotation=0)
plt.show()
<Figure size 1300x1000 with 0 Axes>
We can see that some features have extreme values (outliers).
We can see some pairs have linear relationship.
df.hist(figsize=(12, 10), bins=30, edgecolor="black")
plt.subplots_adjust(hspace=0.7, wspace=0.4)
The target variable has 2 values: 0 and 1
Most of features have long-tail distribution.
# Response variable
y = df['target']
# Explanatory variables
X = df.drop('target', axis=1)
plt.figure(figsize=(8, 6))
sns.countplot(x='target', data=df, palette='Set2')
plt.title('Bar Chart of Target Variable')
plt.xlabel('Target')
plt.ylabel('Count')
plt.xticks([0, 1], ['0', '1']) # If your target variable has string labels, you can use this line to set the tick labels
plt.show()
/var/folders/kt/rg6_d5c90v16qpfm3hltsfb80000gn/T/ipykernel_2519/317489736.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x='target', data=df, palette='Set2') //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
This is unbalance data >> we can improve by balancing the data
num_cols = X.columns
num_cols
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension'],
dtype='object')
# Box plots for outlier detection
ncols = 5
nrows = int(len(num_cols) / ncols) + (len(num_cols) % ncols > 0)
plt.figure(figsize=(20, nrows * 4))
for i, col in enumerate(num_cols, 1):
plt.subplot(nrows, ncols, i)
sns.boxplot(y=df[col])
plt.title(col)
plt.xticks([])
plt.tight_layout()
plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
cov_matrix = np.cov(X, rowvar=False)
inv_cov_matrix = np.linalg.inv(cov_matrix)
mean_vector = X.mean(axis=0)
# Function to compute Mahalanobis distance for each row in the dataset
def mahalanobis_distance(row, mean_vector, inv_cov_matrix):
diff = row - mean_vector
md = np.sqrt(np.dot(np.dot(diff, inv_cov_matrix), diff.T))
return md
X['mahalanobis_distance'] = X.apply(lambda row: mahalanobis_distance(row, mean_vector, inv_cov_matrix), axis=1)
threshold = chi2.ppf(0.99, df=X.shape[1] - 1) # 99% confidence level
# Identify outliers
X['is_outlier'] = X['mahalanobis_distance'] > threshold
outliers = X[X['is_outlier']]
print("Number of outliers detected:", outliers.shape[0])
print(outliers)
# Visualize the Mahalanobis distance
plt.figure(figsize=(10, 6))
sns.histplot(X['mahalanobis_distance'], bins=30, kde=True)
plt.axvline(x=threshold, color='red', linestyle='--', label='Threshold')
plt.title('Distribution of Mahalanobis Distance')
plt.xlabel('Mahalanobis Distance')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Number of outliers detected: 0 Empty DataFrame Columns: [mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error, concave points error, symmetry error, fractal dimension error, worst radius, worst texture, worst perimeter, worst area, worst smoothness, worst compactness, worst concavity, worst concave points, worst symmetry, worst fractal dimension, mahalanobis_distance, is_outlier] Index: [] [0 rows x 32 columns]
plt.figure(figsize=(13, 10))
corr = X.corr() # correlation
mask = np.triu(np.ones_like(corr, dtype=bool)) # masking the upper triangle
np.fill_diagonal(mask, False) # diagonal 1s
# heatmap
sns.heatmap(corr, annot=True, cmap='Blues', mask=mask)
plt.yticks(rotation=0)
plt.show()
strong_positive_corr = corr[(corr > 0.8) & (corr < 1)]
sns.heatmap(strong_positive_corr, annot=True, cmap='Blues', mask=mask)
plt.yticks(rotation=0)
plt.show()
print("Strong Positive Correlations (coefficient > 0.8):")
print(strong_positive_corr.dropna(how='all', axis=0).dropna(how='all', axis=1))
Strong Positive Correlations (coefficient > 0.8):
mean radius mean texture mean perimeter mean area \
mean radius NaN NaN 0.997855 0.987357
mean texture NaN NaN NaN NaN
mean perimeter 0.997855 NaN NaN 0.986507
mean area 0.987357 NaN 0.986507 NaN
mean smoothness NaN NaN NaN NaN
mean compactness NaN NaN NaN NaN
mean concavity NaN NaN NaN NaN
mean concave points 0.822529 NaN 0.850977 0.823269
radius error NaN NaN NaN NaN
perimeter error NaN NaN NaN NaN
area error NaN NaN NaN 0.800086
compactness error NaN NaN NaN NaN
concavity error NaN NaN NaN NaN
fractal dimension error NaN NaN NaN NaN
worst radius 0.969539 NaN 0.969476 0.962746
worst texture NaN 0.912045 NaN NaN
worst perimeter 0.965137 NaN 0.970387 0.959120
worst area 0.941082 NaN 0.941550 0.959213
worst smoothness NaN NaN NaN NaN
worst compactness NaN NaN NaN NaN
worst concavity NaN NaN NaN NaN
worst concave points NaN NaN NaN NaN
worst fractal dimension NaN NaN NaN NaN
mean smoothness mean compactness mean concavity \
mean radius NaN NaN NaN
mean texture NaN NaN NaN
mean perimeter NaN NaN NaN
mean area NaN NaN NaN
mean smoothness NaN NaN NaN
mean compactness NaN NaN 0.883121
mean concavity NaN 0.883121 NaN
mean concave points NaN 0.831135 0.921391
radius error NaN NaN NaN
perimeter error NaN NaN NaN
area error NaN NaN NaN
compactness error NaN NaN NaN
concavity error NaN NaN NaN
fractal dimension error NaN NaN NaN
worst radius NaN NaN NaN
worst texture NaN NaN NaN
worst perimeter NaN NaN NaN
worst area NaN NaN NaN
worst smoothness 0.805324 NaN NaN
worst compactness NaN 0.865809 NaN
worst concavity NaN 0.816275 0.884103
worst concave points NaN 0.815573 0.861323
worst fractal dimension NaN NaN NaN
mean concave points radius error perimeter error \
mean radius 0.822529 NaN NaN
mean texture NaN NaN NaN
mean perimeter 0.850977 NaN NaN
mean area 0.823269 NaN NaN
mean smoothness NaN NaN NaN
mean compactness 0.831135 NaN NaN
mean concavity 0.921391 NaN NaN
mean concave points NaN NaN NaN
radius error NaN NaN 0.972794
perimeter error NaN 0.972794 NaN
area error NaN 0.951830 0.937655
compactness error NaN NaN NaN
concavity error NaN NaN NaN
fractal dimension error NaN NaN NaN
worst radius 0.830318 NaN NaN
worst texture NaN NaN NaN
worst perimeter 0.855923 NaN NaN
worst area 0.809630 NaN NaN
worst smoothness NaN NaN NaN
worst compactness NaN NaN NaN
worst concavity NaN NaN NaN
worst concave points 0.910155 NaN NaN
worst fractal dimension NaN NaN NaN
... fractal dimension error worst radius \
mean radius ... NaN 0.969539
mean texture ... NaN NaN
mean perimeter ... NaN 0.969476
mean area ... NaN 0.962746
mean smoothness ... NaN NaN
mean compactness ... NaN NaN
mean concavity ... NaN NaN
mean concave points ... NaN 0.830318
radius error ... NaN NaN
perimeter error ... NaN NaN
area error ... NaN NaN
compactness error ... 0.803269 NaN
concavity error ... NaN NaN
fractal dimension error ... NaN NaN
worst radius ... NaN NaN
worst texture ... NaN NaN
worst perimeter ... NaN 0.993708
worst area ... NaN 0.984015
worst smoothness ... NaN NaN
worst compactness ... NaN NaN
worst concavity ... NaN NaN
worst concave points ... NaN NaN
worst fractal dimension ... NaN NaN
worst texture worst perimeter worst area \
mean radius NaN 0.965137 0.941082
mean texture 0.912045 NaN NaN
mean perimeter NaN 0.970387 0.941550
mean area NaN 0.959120 0.959213
mean smoothness NaN NaN NaN
mean compactness NaN NaN NaN
mean concavity NaN NaN NaN
mean concave points NaN 0.855923 0.809630
radius error NaN NaN NaN
perimeter error NaN NaN NaN
area error NaN NaN 0.811408
compactness error NaN NaN NaN
concavity error NaN NaN NaN
fractal dimension error NaN NaN NaN
worst radius NaN 0.993708 0.984015
worst texture NaN NaN NaN
worst perimeter NaN NaN 0.977578
worst area NaN 0.977578 NaN
worst smoothness NaN NaN NaN
worst compactness NaN NaN NaN
worst concavity NaN NaN NaN
worst concave points NaN 0.816322 NaN
worst fractal dimension NaN NaN NaN
worst smoothness worst compactness worst concavity \
mean radius NaN NaN NaN
mean texture NaN NaN NaN
mean perimeter NaN NaN NaN
mean area NaN NaN NaN
mean smoothness 0.805324 NaN NaN
mean compactness NaN 0.865809 0.816275
mean concavity NaN NaN 0.884103
mean concave points NaN NaN NaN
radius error NaN NaN NaN
perimeter error NaN NaN NaN
area error NaN NaN NaN
compactness error NaN NaN NaN
concavity error NaN NaN NaN
fractal dimension error NaN NaN NaN
worst radius NaN NaN NaN
worst texture NaN NaN NaN
worst perimeter NaN NaN NaN
worst area NaN NaN NaN
worst smoothness NaN NaN NaN
worst compactness NaN NaN 0.892261
worst concavity NaN 0.892261 NaN
worst concave points NaN 0.801080 0.855434
worst fractal dimension NaN 0.810455 NaN
worst concave points worst fractal dimension
mean radius NaN NaN
mean texture NaN NaN
mean perimeter NaN NaN
mean area NaN NaN
mean smoothness NaN NaN
mean compactness 0.815573 NaN
mean concavity 0.861323 NaN
mean concave points 0.910155 NaN
radius error NaN NaN
perimeter error NaN NaN
area error NaN NaN
compactness error NaN NaN
concavity error NaN NaN
fractal dimension error NaN NaN
worst radius NaN NaN
worst texture NaN NaN
worst perimeter 0.816322 NaN
worst area NaN NaN
worst smoothness NaN NaN
worst compactness 0.801080 0.810455
worst concavity 0.855434 NaN
worst concave points NaN NaN
worst fractal dimension NaN NaN
[23 rows x 23 columns]
strong_negative_corr = corr[corr < -0.8]
sns.heatmap(strong_negative_corr, annot=True, cmap='Blues', mask=mask)
plt.yticks(rotation=0)
plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/matrix.py:202: RuntimeWarning: All-NaN slice encountered vmin = np.nanmin(calc_data) //anaconda3/lib/python3.11/site-packages/seaborn/matrix.py:207: RuntimeWarning: All-NaN slice encountered vmax = np.nanmax(calc_data)
print("Strong Negative Correlations (coefficient < -0.8):")
print(strong_negative_corr.dropna(how='all', axis=0).dropna(how='all', axis=1))
Strong Negative Correlations (coefficient < -0.8): Empty DataFrame Columns: [] Index: []
Utilize Scree Plot, Cumulative Variance (targeting at least 80% of total variance), and Kaiser's Criterion to determine the optimal number of principle components.
X.drop(['mahalanobis_distance', 'is_outlier'], axis =1, inplace = True)
X.head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
X.shape
(569, 30)
# standardized data
sc = StandardScaler()
sc_X = sc.fit_transform(X)
sc_X = pd.DataFrame(sc_X, columns = X.columns)
sc_X.head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.097064 | -2.073335 | 1.269934 | 0.984375 | 1.568466 | 3.283515 | 2.652874 | 2.532475 | 2.217515 | 2.255747 | ... | 1.886690 | -1.359293 | 2.303601 | 2.001237 | 1.307686 | 2.616665 | 2.109526 | 2.296076 | 2.750622 | 1.937015 |
| 1 | 1.829821 | -0.353632 | 1.685955 | 1.908708 | -0.826962 | -0.487072 | -0.023846 | 0.548144 | 0.001392 | -0.868652 | ... | 1.805927 | -0.369203 | 1.535126 | 1.890489 | -0.375612 | -0.430444 | -0.146749 | 1.087084 | -0.243890 | 0.281190 |
| 2 | 1.579888 | 0.456187 | 1.566503 | 1.558884 | 0.942210 | 1.052926 | 1.363478 | 2.037231 | 0.939685 | -0.398008 | ... | 1.511870 | -0.023974 | 1.347475 | 1.456285 | 0.527407 | 1.082932 | 0.854974 | 1.955000 | 1.152255 | 0.201391 |
| 3 | -0.768909 | 0.253732 | -0.592687 | -0.764464 | 3.283553 | 3.402909 | 1.915897 | 1.451707 | 2.867383 | 4.910919 | ... | -0.281464 | 0.133984 | -0.249939 | -0.550021 | 3.394275 | 3.893397 | 1.989588 | 2.175786 | 6.046041 | 4.935010 |
| 4 | 1.750297 | -1.151816 | 1.776573 | 1.826229 | 0.280372 | 0.539340 | 1.371011 | 1.428493 | -0.009560 | -0.562450 | ... | 1.298575 | -1.466770 | 1.338539 | 1.220724 | 0.220556 | -0.313395 | 0.613179 | 0.729259 | -0.868353 | -0.397100 |
5 rows × 30 columns
# PCA
pca = PCA()
pcs = pca.fit_transform(sc_X)
pcs_df = pd.DataFrame(pcs, columns=[f'PC{i+1}' for i in range(pcs.shape[1])])
pcs_df.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | ... | PC21 | PC22 | PC23 | PC24 | PC25 | PC26 | PC27 | PC28 | PC29 | PC30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9.192837 | 1.948583 | -1.123166 | 3.633731 | -1.195110 | 1.411424 | 2.159370 | -0.398407 | -0.157118 | -0.877402 | ... | 0.096515 | 0.068850 | 0.084519 | -0.175256 | -0.151020 | -0.201503 | -0.252585 | -0.033914 | 0.045648 | -0.047169 |
| 1 | 2.387802 | -3.768172 | -0.529293 | 1.118264 | 0.621775 | 0.028656 | 0.013358 | 0.240988 | -0.711905 | 1.106995 | ... | -0.077327 | -0.094578 | -0.217718 | 0.011290 | -0.170510 | -0.041129 | 0.181270 | 0.032624 | -0.005687 | -0.001868 |
| 2 | 5.733896 | -1.075174 | -0.551748 | 0.912083 | -0.177086 | 0.541452 | -0.668166 | 0.097374 | 0.024066 | 0.454275 | ... | 0.311067 | -0.060309 | -0.074291 | 0.102762 | 0.171158 | 0.004735 | 0.049569 | 0.047026 | 0.003146 | 0.000751 |
| 3 | 7.122953 | 10.275589 | -3.232790 | 0.152547 | -2.960878 | 3.053422 | 1.429911 | 1.059565 | -1.405440 | -1.116975 | ... | 0.434193 | -0.203266 | -0.124105 | 0.153430 | 0.077496 | -0.275225 | 0.183462 | 0.042484 | -0.069295 | -0.019937 |
| 4 | 3.935302 | -1.948072 | 1.389767 | 2.940639 | 0.546747 | -1.226495 | -0.936213 | 0.636376 | -0.263805 | 0.377704 | ... | -0.116545 | -0.017650 | 0.139454 | -0.005332 | 0.003062 | 0.039254 | 0.032168 | -0.034786 | 0.005038 | 0.021214 |
5 rows × 30 columns
# Scree plot
var_pct = np.round(pca.explained_variance_ratio_ * 100, decimals = 1)
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(var_pct) + 1), var_pct, alpha=0.8)
plt.plot(range(1, len(var_pct) + 1), var_pct, color='k', linestyle=':', marker='o')
plt.xticks(range(1, len(var_pct) + 1))
plt.xlabel('Principle Components')
plt.ylabel('Explained Variance (%)')
plt.show()
pca1 = PCA(0.8)
pca_result1 = pca1.fit_transform(sc_X)
pca1.n_components_
5
##### Kaiser criteria
eigenvalues = pca.explained_variance_
kaiser = sum(eigenvalues > 1)
# Plot eigenvalues
plt.bar(range(1, len(eigenvalues) + 1), eigenvalues, alpha=0.7)
plt.axhline(y=1, color='k', linestyle=':', marker='o', label='Eigenvalue = 1 (Kaiser Criterion)')
plt.xticks(range(1, len(eigenvalues) + 1))
plt.ylabel("Eigenvalues")
plt.xlabel("Component #")
plt.legend()
plt.title("Scree Plot")
plt.show()
explained_variance_ratio = eigenvalues / np.sum(eigenvalues)
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
components_needed = np.argmax(cumulative_variance_ratio >= 0.8) + 1
print("Number of principal components needed to explain at least 80% of the total variance:", components_needed)
Number of principal components needed to explain at least 80% of the total variance: 5
Answer:
After Utilizing Scree Plot, Cumulative Variance (targeting at least 80% of total variance), and Kaiser's Criterion, the optimal number of principle components is 5.
# loadings
loadings = pd.DataFrame(pca.components_.T[:, :5], columns = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5'], index = X.columns)
loadings
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| mean radius | 0.218902 | -0.233857 | -0.008531 | 0.041409 | 0.037786 |
| mean texture | 0.103725 | -0.059706 | 0.064550 | -0.603050 | -0.049469 |
| mean perimeter | 0.227537 | -0.215181 | -0.009314 | 0.041983 | 0.037375 |
| mean area | 0.220995 | -0.231077 | 0.028700 | 0.053434 | 0.010331 |
| mean smoothness | 0.142590 | 0.186113 | -0.104292 | 0.159383 | -0.365089 |
| mean compactness | 0.239285 | 0.151892 | -0.074092 | 0.031795 | 0.011704 |
| mean concavity | 0.258400 | 0.060165 | 0.002734 | 0.019123 | 0.086375 |
| mean concave points | 0.260854 | -0.034768 | -0.025564 | 0.065336 | -0.043861 |
| mean symmetry | 0.138167 | 0.190349 | -0.040240 | 0.067125 | -0.305941 |
| mean fractal dimension | 0.064363 | 0.366575 | -0.022574 | 0.048587 | -0.044424 |
| radius error | 0.205979 | -0.105552 | 0.268481 | 0.097941 | -0.154456 |
| texture error | 0.017428 | 0.089980 | 0.374634 | -0.359856 | -0.191651 |
| perimeter error | 0.211326 | -0.089457 | 0.266645 | 0.088992 | -0.120990 |
| area error | 0.202870 | -0.152293 | 0.216007 | 0.108205 | -0.127574 |
| smoothness error | 0.014531 | 0.204430 | 0.308839 | 0.044664 | -0.232066 |
| compactness error | 0.170393 | 0.232716 | 0.154780 | -0.027469 | 0.279968 |
| concavity error | 0.153590 | 0.197207 | 0.176464 | 0.001317 | 0.353982 |
| concave points error | 0.183417 | 0.130322 | 0.224658 | 0.074067 | 0.195548 |
| symmetry error | 0.042498 | 0.183848 | 0.288584 | 0.044073 | -0.252869 |
| fractal dimension error | 0.102568 | 0.280092 | 0.211504 | 0.015305 | 0.263297 |
| worst radius | 0.227997 | -0.219866 | -0.047507 | 0.015417 | -0.004407 |
| worst texture | 0.104469 | -0.045467 | -0.042298 | -0.632808 | -0.092883 |
| worst perimeter | 0.236640 | -0.199878 | -0.048547 | 0.013803 | 0.007454 |
| worst area | 0.224871 | -0.219352 | -0.011902 | 0.025895 | -0.027391 |
| worst smoothness | 0.127953 | 0.172304 | -0.259798 | 0.017652 | -0.324435 |
| worst compactness | 0.210096 | 0.143593 | -0.236076 | -0.091328 | 0.121804 |
| worst concavity | 0.228768 | 0.097964 | -0.173057 | -0.073951 | 0.188519 |
| worst concave points | 0.250886 | -0.008257 | -0.170344 | 0.006007 | 0.043332 |
| worst symmetry | 0.122905 | 0.141883 | -0.271313 | -0.036251 | -0.244559 |
| worst fractal dimension | 0.131784 | 0.275339 | -0.232791 | -0.077053 | 0.094423 |
cumulative_variance_ratio = np.sum(pca.explained_variance_ratio_[:2])
print("Cumulative explained variance ratio up to 2 components:", cumulative_variance_ratio)
Cumulative explained variance ratio up to 2 components: 0.6324320765155949
print("Explained Variance Ratio for PC1 and PC2:", explained_variance_ratio[:2])
Explained Variance Ratio for PC1 and PC2: [0.44272026 0.18971182]
pca1 = PCA(n_components=2)
principalComponents = pca1.fit_transform(sc_X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
principalDf.head(5)
| principal component 1 | principal component 2 | |
|---|---|---|
| 0 | 9.192837 | 1.948583 |
| 1 | 2.387802 | -3.768172 |
| 2 | 5.733896 | -1.075174 |
| 3 | 7.122953 | 10.275589 |
| 4 | 3.935302 | -1.948072 |
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
finalDf.head(5)
| principal component 1 | principal component 2 | target | |
|---|---|---|---|
| 0 | 9.192837 | 1.948583 | 0 |
| 1 | 2.387802 | -3.768172 | 0 |
| 2 | 5.733896 | -1.075174 | 0 |
| 3 | 7.122953 | 10.275589 | 0 |
| 4 | 3.935302 | -1.948072 | 0 |
sns.lmplot(x='principal component 1',y='principal component 2',
data=finalDf, hue = 'target' ,fit_reg=False,
height=6, aspect=1)
<seaborn.axisgrid.FacetGrid at 0x155972310>
The plot shows that just some values are overlapping between cancer and no-cancer patients.
loadings_pc1 = loadings['PC1'].sort_values(ascending=False)
loadings_pc1
mean concave points 0.260854 mean concavity 0.258400 worst concave points 0.250886 mean compactness 0.239285 worst perimeter 0.236640 worst concavity 0.228768 worst radius 0.227997 mean perimeter 0.227537 worst area 0.224871 mean area 0.220995 mean radius 0.218902 perimeter error 0.211326 worst compactness 0.210096 radius error 0.205979 area error 0.202870 concave points error 0.183417 compactness error 0.170393 concavity error 0.153590 mean smoothness 0.142590 mean symmetry 0.138167 worst fractal dimension 0.131784 worst smoothness 0.127953 worst symmetry 0.122905 worst texture 0.104469 mean texture 0.103725 fractal dimension error 0.102568 mean fractal dimension 0.064363 symmetry error 0.042498 texture error 0.017428 smoothness error 0.014531 Name: PC1, dtype: float64
loadings_pc2 = loadings['PC2'].sort_values(ascending=False)
loadings_pc2
mean fractal dimension 0.366575 fractal dimension error 0.280092 worst fractal dimension 0.275339 compactness error 0.232716 smoothness error 0.204430 concavity error 0.197207 mean symmetry 0.190349 mean smoothness 0.186113 symmetry error 0.183848 worst smoothness 0.172304 mean compactness 0.151892 worst compactness 0.143593 worst symmetry 0.141883 concave points error 0.130322 worst concavity 0.097964 texture error 0.089980 mean concavity 0.060165 worst concave points -0.008257 mean concave points -0.034768 worst texture -0.045467 mean texture -0.059706 perimeter error -0.089457 radius error -0.105552 area error -0.152293 worst perimeter -0.199878 mean perimeter -0.215181 worst area -0.219352 worst radius -0.219866 mean area -0.231077 mean radius -0.233857 Name: PC2, dtype: float64
top_features_pc1 = loadings_pc1.head(2)
print("Top 2 features for PC1:")
print(top_features_pc1)
top_features_pc2 = loadings_pc2.head(2)
print("\nTop 2 features for PC2:")
print(top_features_pc2)
Top 2 features for PC1: mean concave points 0.260854 mean concavity 0.258400 Name: PC1, dtype: float64 Top 2 features for PC2: mean fractal dimension 0.366575 fractal dimension error 0.280092 Name: PC2, dtype: float64
# the complete set of features
sc_X
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.097064 | -2.073335 | 1.269934 | 0.984375 | 1.568466 | 3.283515 | 2.652874 | 2.532475 | 2.217515 | 2.255747 | ... | 1.886690 | -1.359293 | 2.303601 | 2.001237 | 1.307686 | 2.616665 | 2.109526 | 2.296076 | 2.750622 | 1.937015 |
| 1 | 1.829821 | -0.353632 | 1.685955 | 1.908708 | -0.826962 | -0.487072 | -0.023846 | 0.548144 | 0.001392 | -0.868652 | ... | 1.805927 | -0.369203 | 1.535126 | 1.890489 | -0.375612 | -0.430444 | -0.146749 | 1.087084 | -0.243890 | 0.281190 |
| 2 | 1.579888 | 0.456187 | 1.566503 | 1.558884 | 0.942210 | 1.052926 | 1.363478 | 2.037231 | 0.939685 | -0.398008 | ... | 1.511870 | -0.023974 | 1.347475 | 1.456285 | 0.527407 | 1.082932 | 0.854974 | 1.955000 | 1.152255 | 0.201391 |
| 3 | -0.768909 | 0.253732 | -0.592687 | -0.764464 | 3.283553 | 3.402909 | 1.915897 | 1.451707 | 2.867383 | 4.910919 | ... | -0.281464 | 0.133984 | -0.249939 | -0.550021 | 3.394275 | 3.893397 | 1.989588 | 2.175786 | 6.046041 | 4.935010 |
| 4 | 1.750297 | -1.151816 | 1.776573 | 1.826229 | 0.280372 | 0.539340 | 1.371011 | 1.428493 | -0.009560 | -0.562450 | ... | 1.298575 | -1.466770 | 1.338539 | 1.220724 | 0.220556 | -0.313395 | 0.613179 | 0.729259 | -0.868353 | -0.397100 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 564 | 2.110995 | 0.721473 | 2.060786 | 2.343856 | 1.041842 | 0.219060 | 1.947285 | 2.320965 | -0.312589 | -0.931027 | ... | 1.901185 | 0.117700 | 1.752563 | 2.015301 | 0.378365 | -0.273318 | 0.664512 | 1.629151 | -1.360158 | -0.709091 |
| 565 | 1.704854 | 2.085134 | 1.615931 | 1.723842 | 0.102458 | -0.017833 | 0.693043 | 1.263669 | -0.217664 | -1.058611 | ... | 1.536720 | 2.047399 | 1.421940 | 1.494959 | -0.691230 | -0.394820 | 0.236573 | 0.733827 | -0.531855 | -0.973978 |
| 566 | 0.702284 | 2.045574 | 0.672676 | 0.577953 | -0.840484 | -0.038680 | 0.046588 | 0.105777 | -0.809117 | -0.895587 | ... | 0.561361 | 1.374854 | 0.579001 | 0.427906 | -0.809587 | 0.350735 | 0.326767 | 0.414069 | -1.104549 | -0.318409 |
| 567 | 1.838341 | 2.336457 | 1.982524 | 1.735218 | 1.525767 | 3.272144 | 3.296944 | 2.658866 | 2.137194 | 1.043695 | ... | 1.961239 | 2.237926 | 2.303601 | 1.653171 | 1.430427 | 3.904848 | 3.197605 | 2.289985 | 1.919083 | 2.219635 |
| 568 | -1.808401 | 1.221792 | -1.814389 | -1.347789 | -3.112085 | -1.150752 | -1.114873 | -1.261820 | -0.820070 | -0.561032 | ... | -1.410893 | 0.764190 | -1.432735 | -1.075813 | -1.859019 | -1.207552 | -1.305831 | -1.745063 | -0.048138 | -0.751207 |
569 rows × 30 columns
# using 2 PC as predictors
pca_2X = pcs_df.iloc[:, :2]
pca_2X.head()
| PC1 | PC2 | |
|---|---|---|
| 0 | 9.192837 | 1.948583 |
| 1 | 2.387802 | -3.768172 |
| 2 | 5.733896 | -1.075174 |
| 3 | 7.122953 | 10.275589 |
| 4 | 3.935302 | -1.948072 |
### Tran/split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_2pca, X_test_2pca, _, _ = train_test_split(pca_2X, y, test_size = 0.2, random_state = 42)
X_train.head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 68 | 9.029 | 17.33 | 58.79 | 250.5 | 0.10660 | 0.14130 | 0.31300 | 0.04375 | 0.2111 | 0.08046 | ... | 10.31 | 22.65 | 65.50 | 324.7 | 0.14820 | 0.43650 | 1.25200 | 0.17500 | 0.4228 | 0.11750 |
| 181 | 21.090 | 26.57 | 142.70 | 1311.0 | 0.11410 | 0.28320 | 0.24870 | 0.14960 | 0.2395 | 0.07398 | ... | 26.68 | 33.48 | 176.50 | 2089.0 | 0.14910 | 0.75840 | 0.67800 | 0.29030 | 0.4098 | 0.12840 |
| 63 | 9.173 | 13.86 | 59.20 | 260.9 | 0.07721 | 0.08751 | 0.05988 | 0.02180 | 0.2341 | 0.06963 | ... | 10.01 | 19.23 | 65.59 | 310.1 | 0.09836 | 0.16780 | 0.13970 | 0.05087 | 0.3282 | 0.08490 |
| 248 | 10.650 | 25.22 | 68.01 | 347.0 | 0.09657 | 0.07234 | 0.02379 | 0.01615 | 0.1897 | 0.06329 | ... | 12.25 | 35.19 | 77.98 | 455.7 | 0.14990 | 0.13980 | 0.11250 | 0.06136 | 0.3409 | 0.08147 |
| 60 | 10.170 | 14.88 | 64.55 | 311.9 | 0.11340 | 0.08061 | 0.01084 | 0.01290 | 0.2743 | 0.06960 | ... | 11.02 | 17.45 | 69.86 | 368.6 | 0.12750 | 0.09866 | 0.02168 | 0.02579 | 0.3557 | 0.08020 |
5 rows × 30 columns
X_train_2pca.head()
| PC1 | PC2 | |
|---|---|---|
| 68 | 4.330003 | 9.202526 |
| 181 | 9.007166 | 0.581031 |
| 63 | -2.314132 | 3.267990 |
| 248 | -2.582556 | 0.729213 |
| 60 | -2.385836 | 2.757658 |
## Training and Predicting
from sklearn.linear_model import LogisticRegression
logmodel_full = LogisticRegression(C=0.001)
logmodel_full.fit(X_train,y_train)
//anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
LogisticRegression(C=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=0.001)
logmodel_2pca = LogisticRegression(C=0.001)
logmodel_2pca.fit(X_train_2pca,y_train)
LogisticRegression(C=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=0.001)
## Model evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
y_pred_full = logmodel_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
auc_full = roc_auc_score(y_test, logmodel_full.predict_proba(X_test)[:, 1])
classification_report_full = classification_report(y_test, y_pred_full)
# Displaying the evaluation metrics
print("Evaluation Metrics for Logistic Regression with Complete Set of Features:")
print("Accuracy:", accuracy_full)
print("AUC Score:", auc_full)
print("Classification Report:")
print(classification_report_full)
Evaluation Metrics for Logistic Regression with Complete Set of Features:
Accuracy: 0.9649122807017544
AUC Score: 0.9990173599737963
Classification Report:
precision recall f1-score support
0 0.98 0.93 0.95 43
1 0.96 0.99 0.97 71
accuracy 0.96 114
macro avg 0.97 0.96 0.96 114
weighted avg 0.97 0.96 0.96 114
y_pred_2pca = logmodel_2pca.predict(X_test_2pca)
accuracy_2pca = accuracy_score(y_test, y_pred_2pca)
auc_2pca = roc_auc_score(y_test, logmodel_2pca.predict_proba(X_test_2pca)[:, 1])
classification_report_2pca = classification_report(y_test, y_pred_2pca)
# Displaying the evaluation metrics
print("\nEvaluation Metrics for Logistic Regression with First Two Principal Components:")
print("Accuracy:", accuracy_2pca)
print("AUC Score:", auc_2pca)
print("Classification Report:")
print(classification_report_2pca)
Evaluation Metrics for Logistic Regression with First Two Principal Components:
Accuracy: 0.8859649122807017
AUC Score: 0.9973796265967901
Classification Report:
precision recall f1-score support
0 1.00 0.70 0.82 43
1 0.85 1.00 0.92 71
accuracy 0.89 114
macro avg 0.92 0.85 0.87 114
weighted avg 0.90 0.89 0.88 114
eval_results = []
for name, y_pred_data in zip(['Original', 'PCA'], [y_pred_full, y_pred_2pca]):
accuracy = accuracy_score(y_test, y_pred_data)
auc = roc_auc_score(y_test, y_pred_data)
classification_report_data = classification_report(y_test, y_pred_data, output_dict=True)
precision_0 = classification_report_data['0']['precision']
recall_0 = classification_report_data['0']['recall']
f1_score_0 = classification_report_data['0']['f1-score']
support_0 = classification_report_data['0']['support']
precision_1 = classification_report_data['1']['precision']
recall_1 = classification_report_data['1']['recall']
f1_score_1 = classification_report_data['1']['f1-score']
support_1 = classification_report_data['1']['support']
eval_results.append({
'Model': name,
'Accuracy': accuracy,
'AUC Score': auc,
'Precision (Class 0)': precision_0,
'Recall (Class 0)': recall_0,
'F1-score (Class 0)': f1_score_0,
'Support (Class 0)': support_0,
'Precision (Class 1)': precision_1,
'Recall (Class 1)': recall_1,
'F1-score (Class 1)': f1_score_1,
'Support (Class 1)': support_1
})
# Create a DataFrame from the evaluation results
eval_df = pd.DataFrame(eval_results)
# Set index to indicate 'Original' and 'PCA'
eval_df.set_index('Model', inplace=True)
# Display the DataFrame
print(eval_df)
Accuracy AUC Score Precision (Class 0) Recall (Class 0) \
Model
Original 0.964912 0.958074 0.97561 0.930233
PCA 0.885965 0.848837 1.00000 0.697674
F1-score (Class 0) Support (Class 0) Precision (Class 1) \
Model
Original 0.952381 43.0 0.958904
PCA 0.821918 43.0 0.845238
Recall (Class 1) F1-score (Class 1) Support (Class 1)
Model
Original 0.985915 0.972222 71.0
PCA 1.000000 0.916129 71.0
Original Model: Outperforms the PCA model in most metrics. It demonstrates high accuracy (0.964912 ~ 0.885965), a good AUC score (0.958074), and strong precision and recall for both classes.
PCA Model shows significantly lower accuracy and AUC score. While it achieves perfect precision for Class 0, its recall for Class 0 is considerably lower. This suggests that it might be overly conservative in classifying instances as Class 0. On the other hand, it has perfect recall for Class 1, potentially due to being too lenient in classifying instances as Class 1. Reason: the data is unbalance.
Based on these metrics, the "Original" model appears to be the better choice overall. It demonstrates a more balanced and reliable performance across both classes. While the PCA model shows perfect precision for Class 0, its low recall suggests potential issues in identifying true Class 0 instances.
Run faster. Save computational resources.
And we still need to collect all of the features from the original dataset, so it may require a lot of resources.
Answer: I still choose PCA with 2 components to train the data